-
Key Ideas
-
mask random patches of the input image and reconstruct the missing pixels
-
an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens
-
Implementation Details(Simple and no sparse operations needed)
-
generate a token for every input patch (by linear projection with an added positional embedding)
-
randomly shuffle the list of tokens and remove the last portion of the list, based on the masking ratio
-
append a list of mask tokens to the list of encoded patches, and unshuffle this full list (inverting the random shuffle operation) to align all tokens with their targets
-
decoder is applied to this full list (with positional embeddings added)